Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

RBCN: Rectiﬁed Binary Convolutional Networks with Generative Adversarial Learning 63

where RBConv denotes the convolution operation implemented as a new module, F ^l

in ^and

F ^l

out ^{are the feature maps before and after convolution, respectively.}^W^l^{are full precision}

ﬁlters, the values of ^ˆW ^lare 1 or −1, and ⊙is the operation of the element-by-element

product.

During the backward propagation process of RBCNs, the full precision ﬁlters W and the

learnable matrices C are required to be learned and updated. These two sets of parameters

are jointly learned. We update W ﬁrst and then C in each convolutional layer.

Update W: Let δW l

i ^{be the gradient of the full precision ﬁlter}^W^l

i ^{. During backpropa-}

gation, the gradients are ﬁrst passed to ^ˆW ^l

i ^{and then to}^W^l

i ^{. Thus,}

δW l

i ⁼^∂^L

∂W ^l

= ^∂^L

∂^ˆ

W ^l

∂^ˆ

W ^l

∂W ^l

(3.67)

where

∂^ˆW ^l

∂W ^l

⎧

⎨

⎩

1.2 + 2W ^l

i ^,

−1 ≤W ^l

i ^<⁰^,

2 −2W ^l

i ^,

0 ≤W ^l

i ^<¹^,

10,

otherwise,

(3.68)

which is an approximation of 2× the Dirac delta function [159]. Furthermore,

∂L

∂^ˆW ^l

= ^∂^L^S

∂^ˆW ^l

+ ^∂^L^Kernel

∂^ˆW ^l

+ ^∂^L^Adv

∂^ˆW ^l

(3.69)

and

W ^l

i ^←^W^l

i ⁻^η¹^δW ^l

i ^,

(3.70)

where η1 is the learning rate. Then,

∂LKernel

∂^ˆW ^l

= −λ1(W ^l

i ⁻^C^l^ˆ^W^l

i ⁾^C^l^,

(3.71)

∂LAdv

∂^ˆW ^l

= −2(1 −D(T ^l

i ^;^Y⁾⁾^∂D

∂^ˆW ^l

(3.72)

Update C: We further update the learnable matrix C^lwith W ^lﬁxed. Let δCl be the

gradient of C^l. Then we have

δCl = ^∂^L^S

∂C^l⁺^∂^L^Kernel

∂C^l

+ ^∂^L^Adv

∂C^l

(3.73)

and

C^l←C^l−η2δCl,

(3.74)

where η2 is another learning rate. Furthermore,

∂LKernel

∂C^l

= −λ1

(W ^l

i ⁻^C^l^ˆ^W^l

i ^{) ˆ}^W^l

i ^,

(3.75)

∂LAdv

∂C^l

= −

2(1 −D(T ^l

i ^;^Y⁾⁾^∂D

∂C^l^.

(3.76)

These derivations show that the rectiﬁed process is trainable in an end-to-end manner.

The complete training process is summarized in Algorithm 13, including how to update

the discriminators. As described in line 17 of Algorithm 13, we independently update other

parameters while ﬁxing the convolutional layer’s parameters to enhance each layer’s feature

maps’ variety. This way, we speed up the training convergence and fully explore the potential

of 1-bit networks. In our implementation, all the values of C^lare replaced by their average

during the forward process. A scalar, not a matrix, is involved in inference, thus speeding

up computation.